Finite-State Non-Concatenative Morphotactics 
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Abstract 

Finite-state morphology in the general tradition 
of the Two-Level and Xerox implementations 
has proved very successful in the production 
of robust morphological analyzer-generators, in- 
cluding many large-scale commercial systems. 
However, it has long been recognized that 
these implementations have serious limitations 
in handling non-concatenative phenomena. We 
describe a new technique for constructing finite- 
state transducers that involves reapplying the 
regular-expression compiler to its own output. 
Implemented in an algorithm called compile- 
replace, this technique has proved useful for 
handling non-concatenative phenomena; and we 
demonstrate it on Malay full-stem reduplication 
and Arabic stem interdigitation. 

1 Introduction 

Most natural languages construct words by con- 
catenating morphemes together in strict orders. 
Such "concatenative morphotactics" can be im- 
pressively productive, especially in agglutina- 
tive languages like Aymara (Figure |lf|) or Turk- 
ish, and in agglutinative/polysynthetic lan- 
guages like Inuktitut (Figure |2]) (|Mallon, 1999| , 
2). In such languages a single word may con- 
tain as many morphemes as an average-length 
English sentence. 

Finite-state morphology in the tradition of 
the Two-Level ( [Koskenniemi, 1983 ) and Xerox 
imple mentations ([Karttunen, 1991 ; Karttunen 



1994; Beesley and Karttunen, 2000) has been 



very successful in implementing large-scale, 
robust and efficient morphological analyzer- 
generators for concatenative languages, includ- 
ing the commercially important European lan- 
guages and non-Indo-European examples like 



Finnish, Turkish and Hungarian. However, 
Koskenniemi himself understood that his initial 
implementation had significant limitations in 
handling non-concatenative morphotactic pro- 
cesses: 

"Only restricted infixation and redu- 
plication can be handled adequately 
with the present system. Some exten- 
sions or revisions will be necessary for 
an adequate description of languages 
possessing extensive infixation or redu- 
plication" ( Koskenniemi, 1983] , 27). 



X I wish to thank Stuart Newton for this example. 



This limitation has of course not escaped the no- 
tice of various reviewers, e.g. Sproat( 1992[) . We 
shall argue that the morphotactic limitations of 
the traditional implementations are the direct 
result of relying solely on the concatenation op- 
eration in morphotactic description. 

We describe a technique, within the Xerox 
implementation of finite-state morphology, that 
corrects the limitations at the source, going be- 
yond concatenation to allow the full range of 
finite-state operations to be used in morphotac- 
tic description. Regular-expression descriptions 
are compiled into finite-state automata or trans- 
ducers (collectively called networks) as usual, 
and then the compiler is re-applied to its own 
output, producing a modified but still finite- 
state network. This technique, implemented 
in an algorithm called COMPILE-REPLACE, has 
already proved useful for handling Malay full- 
stem reduplication and Arabic stem interdigi- 
tation, which will be described below. Before 
illustrating these applications, we will first out- 
line our general approach to finite-state mor- 
phology. 

2 Finite-State Morphology 
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Lexical : uta+ma+na-ka+p+xa+samacha-i+wa 
Surface: uta ma n ka p xa samach i wa 



uta 


IlUUbc UUULJ 


+ma 


= 2nd person possessive 


+na 


= in 


-ka 


= (locative, verbalizer) 


+ P 


= plural 


+xa 


= perfect aspect 


+samacha 


= "apparently" 


-i 


= 3rd person 


+wa 


= topic marker 



Figure 1: Aymara: utamankapxasamachiwa = "it appears that they are in your house' 

Lexical : Paris+mut+nngau+juma+niraq+lauq+sima+nngit+junga 
Surface: Pari mu nngau juma nira lauq sima nngit tunga 



Paris 


= (root = Paris) 




+mut 


= terminalis case 


ending 


+nngau 


= go (verbalizer) 




+juma 


= want 




+niraq 


= declare (that) 




+lauq 


= past 




+sima 


= (added to -lauq- 


- indicates "distant past") 


+nngit 


= negative 




+junga 


= 1st person sing 


present indie (nonspecific) 



Figure 2: Inuktitut: Parimunngaujumaniralauqsimanngittunga = "I never said I wanted to go to 
Paris" 

The basic claim or hope of the finite-state ap- 
proach to natural-language morphology is that 
relations like that represented in Figure || are in 
fact regular relations, i.e. relations between two 
regular languages. The surface language con- 
sists of strings (= words = sequences of sym- 
bols) written according to some defined orthog- 
raphy. In a commercial application for a natural 
language, the surface language to be modeled 
is usually a given, e.g. the set of valid French 
words as written according to standard French 
orthography. The lexical language again con- 
sists of strings, but strings designed according 
to the needs and taste of the linguist, represent- 
ing analyses of the surface words. It is some- 
times convenient to design these lexical strings 
to show all the constituent morphemes in their 
morphophonemic form, separated and identified 
as in Figures [j] and |2[ In other applications, 
it may be useful to design the lexical strings 
to contain the traditional dictionary citation 
form, together with linguist-selected "tag" sym- 



2.1 Analysis and Generation 

In the most theory- and implementation-neutral 
form, morphological analysis and generation of 
written words can be modeled as a relation 
between the words themselves and analyses of 
those words. Computationally, as shown in Fig- 
ure ||, a black-box module maps from words to 
analyses to effect Analysis, and from analyses 
to words to effect Generation. 

ANALYSES 



ANALYZER/ 
GENERATOR 



WORDS 

Figure 3: Morphological Analysis/Generation 
as a Relation between Analyses and Words 
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Regular 
Expression 



Compiler 



Analysis Strings 

L 




Word Strings 

Figure 4: Compilation of a Regular Expression into an fst that Maps between Two Regular 
Languages 



bols like +Noun, +Verb, +SG, +PL, that convey 
category, person, number, tense, mood, case, 
etc. Thus the lexical string representing paie, 
the first-person singular, present indicative form 
of the French verb payer ("to pay"), might be 
spelled payer+IndP+SG+Pl+Verb. The tag sym- 
bols are stored and manipulated just like al- 
phabetic symbols, but they have multicharacter 
print names. 

If the relation is finite-state, then it can be 
defined using the metalanguage of regular ex- 
pressions; and, with a suitable compiler, the 
regular expression source code can be compiled 
into a finite-state transducer (fst), as shown in 
Figure ||, that implements the relation compu- 
tationally. Following convention, we will often 
refer to the upper projection of the fst, repre- 
senting analyses, as the LEXICAL language, a set 
of lexical strings; and we will refer to the lower 
projection as the surface language, consisting 
of surface strings. There are compelling advan- 
tages to computing with such finite-state ma- 
chines, including mathematical elegance, flexi- 
bility, and for most natural-language applica- 
tions, high efficiency and data-compaction. 

One computes with fsts by applying them, 
in either direction, to an input string. When 
one such fst that was written for French is ap- 
plied in an upward direction to the surface word 
maisons ( "houses" ) , it returns the related string 
maison+Fem+PL+Noun, consisting of the citation 
form and tag symbols chosen by a linguist to 
convey that the surface form is a feminine noun 
in the plural form. A single surface string can 
be related to multiple lexical strings, e.g. ap- 
plying this fst in an upward direction to sur- 
face string suis produces the four related lexical 
strings shown in Figure ||. Such ambiguity of 



surface strings is very common. 

etre+IndP+SG+Pl+Verb 
suivre+IndP+SG+P2+Verb 
suivre+IndP+SG+P 1+Verb 
suivre+Imp+SG+P2+Verb 

Figure 5: Multiple Analyses for suis 

Conversely, the very same fst can be applied 
in a downward direction to a lexical string like 
etre+IndP+SG+Pl+Verb to return the related 
surface string suis; such transducers are inher- 
ently bidirectional. Ambiguity in the downward 
direction is also possible, as in the relation of 
the lexical string payer+IndP+SG+Pl+Verb ("I 
pay") to the surface strings paie and paye, which 
are in fact valid alternate spellings in standard 
French orthography. 

2.2 Morphotactics and Alternations 

There are two challenges in modeling natural 
language morphology: 

• Morphotactics 

• Phonological/ Orthographical Alternations 

Finite-state morphology models both using 
regular expressions. The source descriptions 
may also be written in higher-level notations 
(e.g. lexc (|Karttunen, 1993 ), twolc flKarfr 



Eunen and Bccslcy, 1992|) and Re place Rules 



([Karttunen, 199q ; [Karttunen, 1996 ; Kempe and 
Karttunen, 1996| )) that are simply helpful short- 
hands for regular expressions and that compile, 
using their dedicated compilers, into finite-state 
networks. In practice, the most commonly sep- 
arated modules are a lexicon fst, containing 
lexical strings, and a separately written set of 
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Regular Expression 



Rule 
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Lexicon FST 



Rule FST 



Lexical Transducer 
(a single FST) 



Figure 6: Creation of a Lexical Transducer 



rule FSTs that map from the strings in the lex- 
icon to properly spelled surface strings. The 
lexicon description defines the morphotactics of 
the language, and the rules define the alterna- 
tions. The separately compiled lexicon and rule 
FSTs can subsequently be composed together as 
i n Figure p| to form a sin gle "lexical transducer" 
( Karttunen et al., 1992| ) that could have been 
defined equivalently, but perhaps less perspicu- 
ously and less efficiently, with a single regular 
expression. 

In the lexical transducers built at Xerox, the 
strings on the lower side of the transducer are 
inflected surface forms of the language. The 
strings on upper side of the transducer con- 
tain the citation forms of each morpheme and 
any number of tag symbols that indicate the 
inflections and derivations of the correspond- 
ing surface form. For example, the information 
that the comparative of the adjective big is big- 
ger might be represented in the English lexical 
transducer by the path (= sequence of states 
and arcs) in Figure Jj] where the zeros repre- 
sent epsilon symbols.FjThe gemination of g and 



Lexical side: 
i 



b i g +Adj „ 
b i g g e 



+Comp 



Surface side: 

Figure 7: A Path in a Transducer for English 

the epenthetical e in the surface form bigger re- 
sult from the composition of the original lexicon 
fst with the rule fst representing the regular 
morphological alternations in English. 



2 The epsilon symbols and their placement in the 
string are not significant. We will ignore them when- 
ever it is convenient. 



For the sake of clarity, Figure |7] represents the 
upper (= lexical) and the lower (= surface) side 
of the arc label separately on the opposite sides 
of the arc. In the remaining diagrams, we use 
a more compact notation: the upper and the 
lower symbol are combined into a single label 
of the form upper : lower if the symbols are dis- 
tinct. A single symbol is used for an identity 
pair. In the standard notation, the path in Fig- 
ure is labeled as 

b i g 0:g +Adj:0 0:e +Comp:r. 

Lexical transducers are more efficient for 
analysis and generation than the classical two- 
level systems ( Koskenniemi, 1983| ) because the 
morphotactics and the morphological alterna- 
tions have been precompiled and need not be 
consulted at runtime. But it would be possi- 
ble in principle, and perhaps advantageous for 
some purposes, to view the regular expressions 
defining the morphology of a language as an un- 
compiled "virtual network" . All the finite-state 
operations (concatenation, union, intersection, 
composition, etc.) can be simulated by an ap- 
ply routine at runtime. 

Most languages build words by simply string- 
ing morphemes (prefixes, roots and suffixes) 
together in strict orders. The morpho- 
tactic (word-building) processes of prefixa- 
tion and suffixation can be straightforwardly 
modeled in finite state terms as concatena- 
tion. But some natural languages also ex- 
hibit non-concatenative morphotactics. Some- 
times the languages themselves are called "non- 
concatenative languages" , but most employ sig- 
nificant concatenation as well, so the term "not 
completely concatenative" (Lavie et al., 1988) 
is usually more appropriate. 

In Arabic, for example, prefixes and suffixes 
attach to stems in the usual concatenative way, 
but stems themselves are formed by a process 
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known informally as interdigitation; while in 
Malay, noun plurals are formed by a process 
known as full-stem reduplication. Although 
Arabic and Malay also include prefixation and 
suffixation that are modeled straightforwardly 
by concatenation, a complete lexicon cannot be 
obtained without non-concatenative processes. 

We will proceed with descriptions of how 
Malay reduplication and Semitic stem interdig- 
itation are handled in finite-state morphology 
using the new compile-replace algorithm. 

3 Compile-Replace 

The central idea in our approach to the mod- 
eling of non-concatenative processes is to de- 
fine networks using regular expressions, as be- 
fore; but we now define the strings of an in- 
termediate network so that they contain ap- 
propriate substrings that are themselves in the 
format of regular expressions. The compile- 
replace algorithm then reapplies the regular- 
expression compiler to its own output, compil- 
ing the regular-expression substrings in the in- 
termediate network and replacing them with the 
result of the compilation. 

To take a simple non-linguistic example, Fig- 
ure H represents a network that maps the regu- 
lar expression a* into " [a*"] ; that is, the same 
expression enclosed between two special delim- 
iters, " [ and "] , that mark it as a regular- 
expression substring. 




Figure 8: A Network with a Regular- Expression 
Substring on the Lower Side 

The application of the compile-replace algo- 
rithm to the lower side of the network elimi- 
nates the markers, compiles the regular expres- 
sion a* and maps the upper side of the path 
to the language resulting from the compilation. 
The network created by the operation is shown 
in Figure |9[ 

When applied in the "upward" direction, the 
transducer in Figure ^ maps any string of the 
infinite a* language into the regular expression 
from which the language was compiled. 

The compile-replace algorithm is essentially 
a variant of a simple recursive-descent copying 
routine. It expects to find delimited regular- 
expression substrings on a given side (upper or 




*:0 



Figure 9: After the Application of Compile- 
Replace 

lower) of the network. Until an opening delim- 
iter " [ is encountered, the algorithm constructs 
a copy of the path it is following. If the net- 
work contains no regular-expression substrings, 
the result will be a copy of the original network. 
When a " [ is encountered, the algorithm looks 
for a closing ~] and extracts the path between 
the delimiters to be handled in a special way: 

1. The symbols along the indicated side of the 
path are concatenated into a string and 
eliminated from the path leaving just the 
symbols on the opposite side. 

2. A separate network is created that contains 
the modified path. 

3. The extracted string is compiled into a 
second network with the standard regular- 
expression compiler. 

4. The two networks are combined into a sin- 
gle one using the crossproduct operation. 

5. The result is spliced between the states rep- 
resenting the origin and the destination of 
the regular-expression path. 

After the special treatment of the regular- 
expression path is finished, normal processing 
is resumed in the destination state of the clos- 
ing ~] arc. For example, the result shown in 
Figure |9| represents the crossproduct of the two 
networks shown in Figure [K]. 




Figure 10: Networks Illustrating Steps 2 and 3 
of the Compile-Replace Algorithm 

In this simple example, the upper language of 
the original network in Figure || is identical to 
the regular expression that is compiled and re- 
placed. In the linguistic applications presented 
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Lexical: b a g i +Noun +Plural 

Surface: ~[ { b a g i } "2 ~] 



Lexical: pelabuhan +Noun +Plural 

Surface: "[{pelabuhan} "2 



Figure 11: Two Paths in the Initial Malay Transducer Defined via Concatenation 



in the next sections, the two sides of a regular- 
expression path contain different strings. The 
upper side contains morphological information; 
the regular-expression operators appear only on 
the lower side and are not present in the final 
result. 



3.1 Reduplication 

Traditional Two-Level implementations are al- 
ready capable of describing some limited 
reduplication and infixation as in Tagalog 
(lAntworth, 1990| , 156-162). The more chal- 
lenging phenomenon is variable-length redupli- 
cation, as found in Malay and the closely related 
Indonesian language. 

An example of variable-length full-stem redu- 
plication occurs with the Malay stem bagi, 
which means "bag" or "suitcase"; this form is 
in fact number-neutral and can translate as the 
plural. Its overt plural is phonologically bagi- 
bagi^ formed by repeating the stem twice in a 
row. Although this pluralization process may 
appear concatenative, it does not involve con- 
catenating a predictable pluralizing morpheme, 
but rather copying the preceding stem, what- 
ever it may be and however long it may be. 
Thus the overt plural of pelabuhan ("port"), it- 
self a derived form, is phonologically pelabuhan- 
pelabuhan. 

Productive reduplication cannot be described 
by finite-state or even context-free formalisms. 
It is well known that the copy language, {ww 
| w e L}, where each word contains two copies 
of the same string, is a context-sensitive lan- 
guage. However, if the "base" language L is 
finite, we can construct a finite-state network 
that encodes L and the reduplications of all the 
strings in L. On the assumption that there are 
only a finite number of words subject to redu- 
plication (no free compounding), it is possible 
to construct a lexical transducer for languages 



such as Malay. We will show a simple and el- 
egant way to do this with strictly finite-state 
operations. 

To understand the general solution to full- 
stem reduplication using the compile-replace al- 
gorithm requires a bit of background. In the 
regular expression calculus there are several op- 
erators that involve concatenation. For exam- 
ple, if A is a regular expression denoting a lan- 
guage or a relation, A* denotes zero or more and 
A+ denotes one or more concatenations of A with 
itself. There are also operators that express a 
fixed number of concatenations. In the Xerox 
calculus, expressions of the form A~n, where n is 
an integer, denote n concatenations of A. {abc} 
denotes the concatenation of symbols a, b, and 
c. We also employ " [ and "] as delimiter sym- 
bols around regular-expression substrings. 

The reduplication of any string w can then 
be notated as {w}~2, and we start by defining 
a network where the lower-side strings are built 
by simple concatenation of a prefix " [, a root 
enclosed in braces, and an overt-plural suffix ~2 
followed by the closing "] . Figure 11 shows the 
paths for two Malay plurals in the initial net- 
work. 

The compile-replace algorithm, applied to the 
lower-side of this network, recognizes each in- 
dividual delimited regular-expression substring 
like " [{bagi}~2~] , compiles it, and replaces it 
with the result of the compilation, here bagi- 
bagi. The same process applies to the entire 
lower-side language, resulting in a network that 
relates pairs of strings such as the ones in Fig- 
ure 12. This provides the desired solution, still 



finite-state, for analyzing and generating full- 
stem reduplication in Malay.^ 



3 In the standard orthography, such reduplicated 
words are written with a hyphen, e.g. bagi-bagi, that 
we will ignore for this example. 



4 It is well-known ( McCarthy and Prince, 1995 ) that 
reduplication can be a more complex phenomenon than 
it is in Malay. In some languages only a part of the stem 
is reduplicated and there may be systematic differences 
between the reduplicate and the base form. We believe 
that our approach to reduplication can account for these 
complex phenomena as well but we cannot discuss the 
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Lexical: b a g i +Noun +Plural 

Surface: bagibagi 



Lexical: pelabuhan 



+Noun +Plural 



Surface: pelabuhanpelabuhan 

Figure 12: The Malay fst After the Application of Compile-Replace to the Lower-Side Language 



The special delimiters " [ and ~] can be 
used to surround any appropriate regular- 
expression substring, using any necessary 
regular-expression operators, and compile- 
replace may be applied to the lower-side and/or 
upper-side of the network as desired. There 
is nothing to stop the linguist from inserting 
delimiters multiple times, including via compo- 
sition, and reapplying compile-replace multiple 
times (see the Appendix). The technique im- 
plemented in compile-replace is a general way 
of allowing the regular-expression compiler to 
reapply to and modify its own output. 

3.2 Semitic Stem Interdigitation 

3.2.1 Review of Earlier Work 

Much of the work in non-concatenative finite- 
state morphotactics has been dedicated to han- 
dling Semitic stem interdigitation. An example 
of interdigitation occurs with the Arabic stem 
katab, which means "wrote". According to an 
influential autosegmental analysis ( McCarthy,| 



1981 ), this stem consists of an all-consonant 



root ktb whose general meaning has to do with 
writing, an abstract consonant-vowel template 
CVCVC, and a voweling or vocalization that he 
symbolized simply as a, signifying perfect as- 
pect and active voice. The root consonants are 
associated with the C slots of the template and 
the vowel or vowels with the V slots, producing 
a complete stem katab. If the root and the vo- 
calization are thought of as morphemes, neither 
morpheme occurs continuously in the stem. The 
same root ktb can combine with the template 
CVCVC and a different vocalization ui, signifying 
perfect aspect and passive voice, producing the 



tiers of McCarthy ( 1981 ) as projections of a 
multi-level transducer and wrote a small Prolog- 
based prototype that handled the interdigita- 
tion of roots, CV-templates and vocalizations 
into abstract Arabic stems; this general ap- 
proach, with multi-tape transducers, has been 
explored and extended by Kiraz in several pa- 
pers fll994aj ; |i~996| ; p994b| ; [20Q0D with respect to 
Syriac and Arabic. The implementation is de- 
scribed in Kiraz and Grimley-Evans ( |1999| ). 

In work more directly related to the current 
solution, it was Kataja and Koskenniemi ( 198<s| ) 
who first demonstrated that Semitic (Akkadian) 
roots and patterns^] could be formalized as reg- 
ular languages, and that the non-concatenative 
interdigitation of stems could be elegantly for- 
malized as the intersection of those regular lan- 
guages. Thus Akkadian words were formalized 
as consisting of morphemes, some of which were 
combined together by intersection and others of 
which were combined via concatenation. 

This was the key insight: morphotactic de- 
scription could employ various finite-state op- 
erations, not just concatenation; and languages 
that required only concatenation were just spe- 
cial cases. By extension, the widely noticed lim- 
itations of early finite-state implementations in 
dealing with non-concatenative morphotactics 
could be traced to their dependence on the con- 
catenation operation in morphotactic descrip- 
tions. 

This insight of Kataja and Koskenniemi was 
applied by Beesley in a large-scale morphologi- 
cal analyzer for Arabic, first using an implemen- 
tation that simulated the intersection of stems 
in code at runtime (|Beesley, 1989 ; Beesley et 



fctem kutib, which means "was written". Simi- al., 198S| ; Beesley, 199C| ; Beesley, 1991 ) , and ran 
larly, the root ktb can combine with template 
CVVCVC and ui to produce kuutib, the root drs 
can combine with CVCVC and ui to form duris, 
and so forth. 

Kay ( |1987 ) reformalized the autosegmental 



rather slowly; and later, using Xerox finite-state 
technology ( Beesley, 1996 ; Beesley, 1998a ), a 
new implementation that intersected the stems 
at compile time and performed well at runtime. 



issue here due to lack of space. 



These patterns combine what McCarthy (1981) 
would call templates and vocalizations. 
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The 1996 algorithm that intersected roots and 
patterns into stems, and substituted the original 
roots and patterns on just the lower side with 
the intersected stem, was admittedly rather ad 
hoc and computationally intensive, taking over 
two hours to handle about 90,000 stems on a 
SUN Ultra workstation. The compile-replace 
algorithm is a vast improvement in both gener- 
ality and efficiency, producing the same result 
in a few minutes. 

Following the lines of Kataja and Kosken- 
niemi ( 1988j ), we could define intermediate net- 
works with regular-expression substrings that 
indicate the intersection of suitably encoded 
roots, templates, and vocalizations (for a for- 
mal description of what such regular-expression 
substrings would look like, see Beesley (|l998c ; 
1998bD ). However, the general-purpose inter- 



section algorithm would be expensive in any 
non-trivial application, and the interdigitation 
of stems represents a special case of intersection 
that we achieve in practice by a much more ef- 
ficient finite-state algorithm called merge. 

3.2.2 Merge 

The merge algorithm is a pattern-filling oper- 
ation that combines two regular languages, a 
template and a filler, into a single one. The 
strings of the filler language consist of ordinary 
symbols such as d, r, s, u, i. The template 
expressions may contain special class symbols 
such as C (= consonant) or V (= vowel) that 
represent a predefined set of ordinary symbols. 
The objective of the merge operation is to align 
the template strings with the filler strings and 
to instantiate the class symbols of the template 
as the matching filler symbols. 

Like intersection, the merge algorithm oper- 
ates by following two paths, one in the template 
network, the other in the filler network, and it 
constructs the corresponding single path in the 
result network. Every state in the result corre- 
sponds to two original states, one in template, 
the other in the filler. If the original states are 
both final, the resulting state is also final; oth- 
erwise it is non-final. In other words, in order to 
construct a successful path, the algorithm must 
reach a final state in both of the original net- 
works. If the new path terminates in a non-final 
state, it represents a failure and will eventually 
be pruned out. 

The operation starts in the initial state of the 



original networks. At each point, the algorithm 
tries to find all the successful matches between 
the template arcs and filler arcs. A match is 
successful if the filler arc symbol is included in 
the class designated by the template arc sym- 
bol. The main difference between merge and 
classical intersection is in Conditions 1 and 2 
below: 

1. If a successful match is found, a new arc is 
added to the current result state. The arc 
is labeled with the filler arc symbol; its des- 
tination is the result state that corresponds 
to the two original destinations. 

2. If no successful match is found for a given 
template arc, the arc is copied into the cur- 
rent result state. Its destination is the re- 
sult state that corresponds to the destina- 
tion of the template arc and the current 
filler state. 

In effect, Condition 2 preserves any template 
arc that does not find a match. In that case, 
the path in the template network advances to 
a new state while the path in the filler network 
stays at the current state. 



We use the networks in Figure [L3| to illustrate 
the effect of the merge algorithm. Figure 13 



shows a linear template network and two filler 
networks, one of which is cyclic. 



Figure 13: A Template Network and Two Filler 
Networks 

It is easy to see that the merge of the drs 
network with the template network yields the 
result shown in Figure |14|. The three symbols 
of the filler string are instantiated in the three 
consonant slots in the CVVCVC template. 



Figure 14: Intermediate Result. 

Figure pi] presents the final result in which 
the second filler network in Figure ^ is merged 
with the intermediate result shown in Figure 14. 
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Lexical: k t b =Root C V C V C =Template a + =Voc 

Surface: ~[ktb .m>.CVCVC .<m. a+ "] 



Lexical: k t b =Root C V C V C =Template u * i =Voc 

Surface: ~[ k t b .m>.CVCVC .<m. u*i ~] 

Lexical: d r s =Root C V V C V C =Template u * i =Voc 

Surface: ~ [ d r s .m>.CVVCVC .<m. u * i ~] 

Figure 16: Initial paths 



Figure 15: Final Result 

In this case, the filler language contains an in- 
finite set of strings, but only one successful path 
can be constructed. Because the filler string 
ends with a single i, the first two V symbols 
can be instantiated only as u. Note that ordi- 
nary symbols in the partially filled template are 
treated like the class symbols that do not find a 
match. That is, they are copied into the result 
in their current position without consuming a 
filler symbol. 

To introduce the merge operation into the 
Xerox regular expression calculus we need to 
choose an operator symbol. Because merge, like 
subtraction, is a non-commutative operation, 
we also must distinguish between the template 
and the filler. For example, we could choose 
.m. as the operator and decide by convention 
which of the two operands plays which role in 
expressions such as [A . m . B] . What we actu- 
ally have done, perhaps without a sufficiently 
good motivation, is to introduce two variants of 
the merge operator, . <m. and .m>., that dif- 
fer only with respect to whether the template is 
to the left ( . <m . ) or to the right ( . m> . ) of 
the filler. The expression [A . <m . B] repre- 
sents the same merge operation as [B . m> . A] . 
In both cases, A denotes the template, B denotes 
the filler, and the result is the same. With these 



new operators, the network in Figure |15| can be 
compiled from an expression such as 

drs .m>. CVVCVC .<m. u* i 

As we have defined them, . <m . and . m> . are 
weakly binding left-associative operators. In 
this example, the first merge instantiates the 
filler consonants, the second operation fills the 
vowel slots. However, the order in which the 



merge operations are performed is irrelevant in 
this case because the two filler languages do not 
provide competing instantiations for the same 
class symbols. 

3.2.3 Merging Roots and Vocalizations 
with Templates 

Following the tradition, we can represent the 
lexical forms of Arabic stems as consisting of 
three components, a consonantal root, a CV tem- 
plate and a vocalization, possibly preceded and 
followed by additional affixes. In contrast to 
McCarthy, Kay, and Kiraz, we combine the 
three components into a single projection. In a 
sense, McCarthy's three tiers are conflated into 
a single one with three distinct parts. In our 
opinion, there is no substantive difference from 
a computational point of view. 

For example, the initial lexical representation 
of the surface forms katab, kutib, and duuris, 
may be represented ELS cl concatenation of the 



three components shown in Figure |16|. We use 
the symbols =Root, =Template, and =Voc to 
designate the three components of the lexical 
form. The corresponding initial surface form is 
a regular-expression substring, containing two 
merge operators, that will be compiled and re- 
placed by the interdigitated surface form. 

The application of the compile-replace opera- 
tion to the lower side of the initial lexicon yields 
a transducer that maps the Arabic interdigi- 
tated forms directly into their corresponding tri- 
partite analyses and vice versa, as illustrated in 



Figure [17 . 

Alternation rules are subsequently composed 
on the lower side of the result to map the in- 
terdigitated, but still morphophonemic, strings 
into real surface strings. 

Although many Arabic templates are widely 
considered to be pure CV-patterns, it has 
been argued that certain templates also contain 
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Lexical: ktb =Root C V C V C =Template a + =Voc 
Surface: k a t a b 



Lexical: ktb =Root C V C V C =Template u * i =Voc 

Surface: k u t i b 

Lexical: d r s =Root C V V C V C =Template u * i =Voc 

Surface: d u u r i s 

Figure 17: After Applying Compile-Replace to the Lower Side 



"hard-wired" specific vowels and consonants.^] 
For example, the so-called "FormVIII" template 
is considered, by some linguists, to contain an 
embedded t: CtVCVC. 

The presence of ordinary symbols in the tem- 
plate does not pose any problem for the anal- 
ysis adopted here. As we already mentioned 
in discussing the intermediate representation in 
Figure 0, the merge operation treats ordinary 
symbols in a partially filled template in the 
same manner as it treats unmatched class sym- 
bols. The merge of a root such as ktb with the 
presumed FormVIII template and the a+ vocal- 



lsm, 



ktb .m>. CtVCVC .<m. a+ 

produces the desired result, ktatab, without 
any additional mechanism. 

4 Status of the Implementations 

4.1 Malay Morphological 
Analyzer/Generator 

Malay and Indonesian are closely-related lan- 
guages characterized by rich derivation and 
little or nothing that could be called inflec- 
tion. The Malay morphological analyzer pro- 
totype, written using lexc, Replace Rules, and 
compile-replace, implements approximately 50 
different derivational processes, including pre- 
fixation, suffocation, prefix-suffix pairs (circum- 
fixation), reduplication, some infixation, and 
combinations of these processes. Each root is 
marked manually in the source dictionary to in- 
dicate the idiosyncratic subset of derivational 
processes that it undergoes. 

The small prototype dictionary, stored in 
an XML format, contains approximately 1000 
roots, with about 1500 derivational subentries 
(i.e. an average of 1.5 derivational processes 



per root). At compile time, the XML dictio- 
nary is parsed and "downtranslated" into the 
source format required for the lexc compiler. 
The XML dictionary could be expanded by any 
competent Malay lexicographer. 

4.2 Arabic Morphological 
Analyzer /Generator 

The current Arabic system has been described 



6 See Beesley (1998c) for a discussion of this contro- 
versial issue. 



in some detail in previous publications ( Beesley, 
1996| ; Peesley, 199"8j Peesley, 1998b| ) and is 
available for testing on the Internet.^] The modi- 
fication of the system to use the compile-replace 
algorithm has not changed the size or the behav- 
ior of the system in any way, but it has reduced 
the compilation time from hours to minutes. 

5 Conclusion 

The well-founded criticism of traditional imple- 
mentations of finite-state morphology, that they 
are limited to handling concatenative morpho- 
tactics, is a direct result of their dependence 
on the concatenation operation in morphotactic 
description. The technique described here, im- 
plemented in the compile-replace algorithm, al- 
lows the regular-expression compiler to reapply 
to and modify its own output, effectively freeing 
morphotactic description to use any finite-state 
operation. Significant experiments with Malay 
and a much larger application in Arabic have 
shown the value of this technique in handling 
two classic examples of non-concatenative mor- 
photactics: full-stem reduplication and Semitic 
stem interdigitation. Work remains to be done 
in applying the technique to other known vari- 
eties of non-concatenative morphotactics. 

The compile-replace algorithm and the merge 
operator introduced in this paper are general 
techniques not limited to handling the specific 



7 |http:/ / www.xrce.xerox.com/rescarch/mltt/arabk/ 
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morphotactic problems we have discussed. We 
expect that they will have many other useful 
applications. One illustration is given in the 
Appendix. 

6 Appendix: Palindrome Extraction 

To demonstrate the power of the compile- 
replace method, let us show how it can be ap- 
plied to solve another "hard" problem: identi- 
fying and extracting all the palindromes from a 
lexicon. Like reduplication, palindrome identifi- 
cation appears at first to require more powerful 
tools than a finite-state calculus. But this task 
can be accomplished, in fact quite efficiently, by 
using the compile-replace technique. 

Let us assume that L is a simple network con- 
structed from an English wordlist. We start by 
extracting from L all the words with a property 
that is necessary but not sufficient for being a 
palindrome, namely, the words whose inverse is 
also an English word. This step can be accom- 
plished by redefining L as [L & L.r] where & 
represents intersection and . r is the reverse op- 
erator. The resulting network contains palin- 
dromes such as madam as well non-palindromes 
such as dog and god. 

The remaining task is to eliminate all the 
words like dog that are not identical to their 
own inverse. This can be done in three 
steps. We first apply the technique used for 
Malay reduplication. That is, we redefine L 
as it- [M ii [ii l XX "]" "~" 2 ", and apply 
the compile-replace operation. At this point 
the lower-side of L contains strings such as 
dogXXdogXX and madamXXmadamXX where XX is 
a specially introduced symbol to mark the mid- 
dle (and the end) of each string. 

The next, and somewhat delicate, step is to 
replace the XX markers by the desired opera- 
tors, intersection and reverse, and to wrap the 
special regular expression delimiters ~ [ and "] 
around the whole lexicon. This can be done by 
composing L with one or several replace trans- 
ducers to yield a network consisting of expres- 
sions such as ~ [ d o g & [d o g] . r "] and 
"[madamfe [m a d a m] . r ~] 

In the third and final step, the application 
of compile-replace eliminates words like dog 
because the intersection of dog with the in- 
verted form god is empty. Only the palin- 
dromes survive the operation. The extrac- 



tion of all the palindromes from the 25K Unix 
/usr /diet /words file by this method takes a cou- 
ple of seconds. 
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